In [1]:
from sklearn.datasets import load_boston
import sklearn.ensemble
import numpy as np
from sklearn.model_selection import train_test_split
import lime
import lime.lime_tabular
Let's load the sklearn data-set called 'boston'. This data is a dataset that contains house prices that is often used for machine learning regessrion examples.
In [2]:
boston = load_boston()
In [3]:
# take a look at the description of the dataset to get familiar with it.
print boston['DESCR']
In [4]:
# Now, let's take a look at the feature names.
print boston['feature_names']
In [5]:
#Now...the data.
print boston['data']
Now that we have our data loaded, we want to build a regression model to forecast boston housing prices. We'll use random forest for this.
First, we'll set up the RF Model and then create our training and test data using the train_test_split module from sklearn. Then, we'll fit the data.
In [6]:
rf = sklearn.ensemble.RandomForestRegressor(n_estimators=1000)
train, test, labels_train, labels_test = train_test_split(boston.data, boston.target, train_size=0.80)
rf.fit(train, labels_train)
Out[6]:
Now that we have a Random Forest Regressor trained, we can check some of the accuracy measures.
In [7]:
print('Random Forest MSError', np.mean((rf.predict(test) - labels_test) ** 2))
In [8]:
print('MSError when predicting the mean', np.mean((labels_train.mean() - labels_test) ** 2))
We can see our errors are generally ok given that our dataset valuaes are in the hundreds of thousands
In [9]:
categorical_features = np.argwhere(
np.array([len(set(boston.data[:,x]))
for x in range(boston.data.shape[1])]) <= 10).flatten()
In [10]:
explainer = lime.lime_tabular.LimeTabularExplainer(train,
feature_names=boston.feature_names,
class_names=['price'],
categorical_features=categorical_features,
verbose=True, mode='regression')
In [16]:
i = 100
exp = explainer.explain_instance(test[i], rf.predict, num_features=5)
exp.show_in_notebook(show_table=True)
In [42]:
len(test)
Out[42]:
In [ ]: